Characterizing superspreading potential of infectious disease: Decomposition of individual transmissibility

In the context of infectious disease transmission, high heterogeneity in individual infectiousness indicates that a few index cases can generate large numbers of secondary cases, a phenomenon commonly known as superspreading. The potential of disease superspreading can be characterized by describing the distribution of secondary cases (of each seed case) as a negative binomial (NB) distribution with the dispersion parameter, k. Based on the feature of NB distribution, there must be a proportion of individuals with individual reproduction number of almost 0, which appears restricted and unrealistic. To overcome this limitation, we generalized the compound structure of a Poisson rate and included an additional parameter, and divided the reproduction number into independent and additive fixed and variable components. Then, the secondary cases followed a Delaporte distribution. We demonstrated that the Delaporte distribution was important for understanding the characteristics of disease transmission, which generated new insights distinct from the NB model. By using real-world dataset, the Delaporte distribution provides improvements in describing the distributions of COVID-19 and SARS cases compared to the NB distribution. The model selection yielded increasing statistical power with larger sample sizes as well as conservative type I error in detecting the improvement in fitting with the likelihood ratio (LR) test. Numerical simulation revealed that the control strategy-making process may benefit from monitoring the transmission characteristics under the Delaporte framework. Our findings highlighted that for the COVID-19 pandemic, population-wide interventions may control disease transmission on a general scale before recommending the high-risk-specific control strategies.


Introduction
The response to infectious disease epidemics can be improved by understanding the characteristics defining the potential to transmit infections between individuals [1]. An intriguing aspect of infectious disease transmission is the circumstances under which the etiological agent is transmitted to a large number of secondary cases from merely a proportion of primary cases [2][3][4][5][6]. The number of secondary transmissions per index case shows levels of heterogeneity [7], while overdispersion refers to transmission with high heterogeneity [8]. Such situations are considered consequences of heterogeneity in individual infectiousness and stochasticity in disease transmission [9,10] as documented by numerous superspreading events [3,[11][12][13][14][15][16]. For example, superspreading potentials and traceable events of COVID-19 transmission have frequently been reported in terms of a scale of k estimates [12,[17][18][19], which appear similar to those of previous epidemics of SARS and Middle East respiratory syndrome coronavirus (MERS-CoV) [5,[20][21][22]. The heterogeneity in transmission is determined by many factors including the characteristics of the host and the pathogen [23], the mode and setting of transmission [17,24], the contact patterns [25], the viability of the pathogen, and the environmental components [8,[26][27][28]. Risk management and disease control strategies may vary and may be adjusted in response to different levels of individual heterogeneity in transmission [11,[29][30][31]. Thus, methods used to characterize heterogeneity in transmission are a public health priority to better understand patterns in infectious disease transmission [32] and in specifying informed control strategies [29,[33][34][35][36].
On one hand, the reproduction number R is commonly adopted to measure the average (or expected) number of secondary cases generated by a typical infectious individual [37]. The scales of R were sometimes given unwarranted priority in the assessment of pandemic potential [2,38,39], which means that R cannot reflect the scale of heterogeneity in individual infectiousness [40][41][42][43]. On the other hand, by acknowledging the heterogeneity in disease transmission patterns, a negative binomial (NB) distribution has been widely applied as a model for count data [44], particularly for offspring cases data that exhibit overdispersion [29], that is, with variance that is greater than the mean values. As such, the heterogeneity in transmission can be quantified by describing the distribution of secondary cases generated by each index case as the NB distribution with dispersion parameter, k [44]. The conceptualization of a NB distribution incorporates the stochastic effects of disease transmission [9] and the variability in individual infectiousness [29]. Mathematically, the framework for the NB distribution was formulated by compounding a Poisson distribution with a Gamma-distributed rate parameter, where the dispersion parameter k accounts for the variation in individual infectiousness reflected in the Gamma distribution [45]. This NB framework was widely adopted and yielded better fitting performance (against the Poisson distribution) in governing real-world observations of offspring cases or cluster sizes [17,29,40,46]. A smaller k value suggests that transmission is more dispersive, and therefore outbreaks are likely to involve superspreading events [3]. When R is fixed, a smaller k corresponds to a lower effectiveness of nonpharmaceutical interventions in controlling epidemics [30,47].
Regarding the description of heterogeneity in transmission from a theoretical standpoint, candidate models have been compared based on their fitting performances to real-world observations [29]. Inspired by the compounding relationship between Poisson and NB distributions, we considered that the composition of the Poisson rate can be modelled using a more generalized framework. In this study, to explain the heterogeneity in the distribution of offspring, we propose the application of the Delaporte distribution, which is a generalized version of the NB distribution and can also be derived by compounding the Poisson rate [48,49]. By fitting several datasets of offspring (or secondary) cases, we illustrated that the Delaporte distribution led to an improved or equivalent fitting performance compared to the NB distribution, and this improvement becomes more evident as the sample size increases. For model selection using the likelihood ratio (LR) test, the Delaporte distribution demonstrated increasing statistical power but a conservative type I error rate for a wide range of sample sizes. We highlight the potential of the Delaporte distribution in quantifying the superspreading characteristics of infectious diseases and for recommending disease control strategies.

Decomposition of the variation in individual infectiousness
Following the classic theoretical framework of disease transmission [9], stochastic effects in transmission are considered to have a Poisson distribution [50], which is denoted X~Poisson(λ). Here, the random variable X denotes the number of secondary cases caused by a randomly-selected primary case, and the parameter λ is the Poisson rate. To account for the variation in individual infectiousness, the Poisson rate λ is a variable attribute among different hosts, and thus the distribution of X becomes a Poisson mixture, as proposed previously in [29].
We then decomposed the offspring number (X) of each index case into two components, including a fixed part (X F ) and variable part (X V ), such that X F + X V = X. Here, X F and X V were assumed to be independent variables and followed the compound Poisson distributions with rate parameters (λ F and λ V ) that followed two Gamma distributions, so that λ F~G amma (mean = R F , dispersion = k F ), and λ V~G amma(mean = R V , dispersion = k V ). This was equivalent to the Poisson rate λ that was directly decomposed into two independent additive components denoted by λ = λ F + λ V [49], where both λ F and λ V are nonnegative values. As such, X is the sum of two independent negative-binomial distributed variables. Referring to the definition in [29], λ was conceptualized as the individual reproduction number [51], which is a random variable and represents the expected number of secondary cases caused by a (particularly) given primary case.
For the fixed component (X F ), we modelled k F ! 1 assuming there was no variation in the fixed part (λ F ) of individual infectiousness. By denoting the probability mass function (PMF) of X as f D (X), the probability of generating function (PGF), g D (�), was as follows: Because the term k F vanishes, we denoted k V by k for convenience. The λ F is the fixed component, which is a constant, and we defined R F = λ F . The λ V is the variable component, which follows a Gamma distribution with a mean R V and dispersion (or shape) parameter k.
Mathematically, X~Poisson(λ F + λ V ) on the condition that λ V~G amma(mean = R V , dispersion = k). Then, the PGF g D (�) is defined as shown in Eq (1): By identifying the PGF g D (�), we find that the distribution of X was a Delaporte distribution, denoted by f D (�), with parameters R F , R V , and k.
If we define R = R F + R V , R is the population reproduction number as the expected (or average) number of secondary cases caused by a (typical) primary case [52,53], and thus we have is the expectation function. The R F and R V account for the fixed and variable components of the reproduction number (R), and thus we have which is the mean of the Delaporte distribution f D (X). As such, X F and X V are components of the (observable) number of offspring cases X, λ F and λ V are components of the (latent) individual reproduction number λ, which is a variable, and R F and R V are components of the population reproduction number R, which is considered as a constant. In particular, the distribution function of λ has both a discrete part and a continuous part.

Delaporte distribution.
Under the formulation of a Delaporte distribution [48], the probability mass function (PMF) f D (X) has three parameters, R F , R V , and k, and is given in Eq (2).
Here, Γ (�) denotes the Gamma function, and the integer x denotes the number of secondary cases. Eq (2) can be considered as a 'convolution' between an NB distribution and a Poisson distribution. Compared to the classic NB distribution proposed in [29], the Delaporte distribution can be restricted to an NB distribution if R F = 0, or equivalently R V = R. Similarly, if R V = 0 or k ! 1, the Delaporte distribution is restricted to a Poisson distribution [49]. Thus, either the NB or Poisson distribution is a special case of Delaporte distribution. Let the fraction of the fixed component ρ be defined as ρ = R F / R, and straightforwardly, we have 0 � ρ � 1. Equivalently, f D (X) in Eq (2) can also be formulated in an alternative version by replacing R F with ρR and R V with (1ρ)R, which is expressed in Eq (3), Here, the three parameters for the Delaporte distribution change to R, ρ, and k. As such, the Delaporte distribution becomes a Poisson distribution when ρ = 1, or a NB distribution when reflects the scale of variation in individual infectiousness, a smaller value for either ρ or k indicates a higher level of transmission heterogeneity or superspreading potential.
The implementation of Delaporte distribution is considered a generalization of the framework proposed in [29], and thus the interpretation of the dispersion parameter k generalizes its meaning in the NB distribution [45]. As the fixed part (R F ) of R vanishes in the NB distribution, 1= ffi ffi ffi k p is the coefficient of variation (CV) of the Gamma distribution followed by the individual reproduction numbers (λ). In the context of the Delaporte distribution, the effect of k on shaping the variation of λ is restricted to the CV of its variable part (λ V ), which is also Gamma-distributed.
Differences in the PMF of Poisson, NB, and Delaporte distributions are shown in Fig 1.

Epidemiological measurements of heterogeneity in transmission.
In epidemiological studies [3,4,17,54], the heterogeneity in disease transmission is frequently reported as a general '20/80' rule [21,55], that is, according to the Pareto principle, whereby 20% of primary cases cause 80% of secondary cases [56]. With the three parameters of the Delaporte distribution, the transmission distribution profiles can be translated in the form of the '20/80' rule. Following the framework proposed in [3,57], the proportion (0 � Q � 1) of secondary cases can be determined by the transmission contributed by a proportion (0 � P � 1) of the most infectious primary cases [33], and vice versa, which was formulated in Eq (4).
Here, b�c denotes the floor function, which outputs the largest integer less than or equal to the given number. Note that the X 1 x¼0 z � f D ðX ¼ xÞ at the denominator is the mean of the Delaporte distribution, i.e., R F + R V or R. Conventionally, Q is fixed at 0.8, and the value of P is of interest. A smaller P indicates that a smaller but core proportion of high-risk cases may generate most offspring cases, indicating a higher level of heterogeneity in transmission.
Generally, Q is considered a function of P, which is bound between 0 and 1 for both Q and P. The concaveness of this 'Q-P' function is positively related to the level of transmission heterogeneity [29], which is constructed in the same manner as the Lorenz curve [58,59]. For a perfectly homogeneous scenario, where X = R is a constant, we have Q = P.
Another important measurement of transmission heterogeneity is the proportion of primary cases that generate 0 secondary cases, which is given as f D (0) = f D (X = 0) based on Eqs (2) or (3). With the reproduction number R fixed, a larger value of f D (0) implies a higher level of heterogeneity in transmission.

Datasets
We adopted six sets of contact tracing data and extracted the observations of offspring cases generated by each seed case for further exemplification. These included five COVID-19 datasets collected in mainland China (dataset #1), South Korea (datasets #2a and b), Hong Kong (dataset #3), and Tianjin, China (dataset #4), and one SARS dataset collected in Beijing, China (dataset #5). The transmission chains within each dataset were screened and then reconstructed with systematic and strict 'inclusion-and-exclusion' screening criteria based on plausible epidemiological evidence and rigorous consistency checks. All datasets were previously published and adopted for analysis in peer-reviewed studies.
Dataset #1 contains 1407 transmission pairs that were identified and reconstructed in previous studies, governmental news release, and official situation reports from 15 January to 29 February 2020 in mainland China. We identified 807 infectors with at least one secondary case and extracted the number of offspring infectees generated by each infector. A total of 1241 sporadic or terminal cases with 0 secondary cases were identified. Thus, dataset #1 includes observations of secondary case numbers with a sample size of 2048.

Datasets #2a and #2b: COVID-19 data in South Korea.
For datasets #2a and #2b, we used the COVID-19 contact tracing data published in [33], which were shared by the authors. Both datasets shared the same source of information from the local public health authorities in South Korea, excluding the Daegu-Gyeongsangbuk region, where the data were not publicly reported.
Referring to [33], the original dataset was divided into different periods according to the onset dates of infectors. Dataset #2a contains 571 infectors with at least one secondary case and 830 sporadic or terminal cases during the epidemic period from 20 April to 16 October 2020. Dataset #2b contains 104 infectors and 240 sporadic or terminal cases occurring during the epidemic period from 19 January to 19 April 2020. As such, datasets #2a and #2b include observations of secondary case numbers with sample sizes of 1401 and 344, respectively.

Dataset #3: COVID-19 data in Hong Kong.
For dataset #3, we used the COVID-19 contact tracing data published in [17], which was accessed freely via public repository, https:// github.com/dcadam/covid-19-sse/blob/master/data/transmission_pairs.csv. Dataset #3 contains 169 transmission pairs that were identified and reconstructed according to governmental news releases and official situation reports published on 7 May 2020 in Hong Kong [60,61]. There were 91 infectors, 153 terminal cases, and 46 local sporadic cases identified, and we extracted information on the number of offspring infectees generated by each infector. As such, dataset #3 included observations of secondary case numbers with a sample size of 290 cases.

Dataset #4: COVID-19 data in Tianjin, China.
For dataset #4, we used the COVID-19 contact tracing data published in [19], which was freely obtained from the supplementary materials, accessed via https://www.mdpi.com/1660-4601/17/10/3705/s1. Dataset #4 contained 36 clusters of cases, including 47 cases of COVID-19, which were identified and reconstructed according to a governmental news releases and official situation reports between 21 January and 26 February 2020 in Tianjin, China [62], and each cluster was caused by a primary case. We identified seven infectors with 11 associated terminal cases and 29 local sporadic cases. Thus, dataset #4 contains observations of secondary case numbers with a sample size of 47.

Dataset #5: SARS data in Beijing, China.
For dataset #5, we used the SARS contact tracing data of superspreading events from April to May 2003 previously published in [5], which was also attempted to estimate the dispersion parameter in [29]. The 34 cases in the first and second generation were considered the source cases, and we extracted the number of offspring infectees generated by each source case. Thus, dataset #5 contained observations of secondary case numbers with a sample size of 34.

Likelihood framework and statistical inference
We considered the number of secondary cases observed from each primary case with a sample size N. Considering the infector who generates j (� 0) secondary cases, or equivalently a cluster of cases with size (j + 1) in one transmission generation, we denoted the number of these infectors by n j . Then, similar to previous studies [3,17], the likelihood of observing n j clusters with size (j + 1) was f D ðX ¼ jÞ ½ � n j . Thus, we construct the overall log-likelihood function, ℓ, in Eq (5).
To match the real-world observations, we adopted a Bayesian fitting procedure with a Metropolis-Hastings Markov chain Monte Carlo (MCMC) algorithm with non-informative prior distributions for parameter estimation. Based on the likelihood in Eq (5), the MCMC was conducted with five chains and 100,000 iterations for each chain, including 40,000 iterations for the burn-in period, to obtain the posterior estimates. The convergence of each MCMC chain was visually checked using trace plots and the Gelman-Rubin-Brooks diagnostic quantitatively [63]. The median and 95% credible intervals (95%CrI) of the posterior distributions of R F , R V , and k were calculated and summarized for comparison with the previous estimates and across each dataset.
For comparisons with the classic Poisson or NB framework, we also repeated the estimation procedures by restricting R F = R (i.e., R V = 0) for the Poisson distribution, or R F = 0 (i.e., R V = R) for the NB distribution.

Evaluation of fitting and testing performance
In accordance with previous study [17], the Akaike information criterion (AIC) of MLE was used to measure the fitting performance of the Poisson, NB, and Delaporte distributions. Statistical evidence supporting the improvement in the fitting performance is claimed when the AIC units are reduced by 2 or more [40,64].
The likelihood ratio (LR) test was adopted to assess the statistical significance of the improvement (in goodness-of-fit) of the Delaporte distribution versus the NB distribution. The test statistic (π � ) of the LR test was given as follows.
where L NB denotes the likelihood of the NB distribution and L denotes the likelihood of the Delaporte distribution. Therefore, the p-value was calculated as the percentile of the Chisquared distribution with degree of freedom df = 1 [11], which was expressed as follows: Here, pChi (�) denotes 1 minus the cumulative distribution function (i.e., survival function) of the Chi-squared distribution. Similar frameworks have also been adopted in previous studies [46,[64][65][66]. We considered p-value < 0.05 as a statistically significant improvement of the Delaporte distribution compared to the NB distribution, and thus the Delaporte distribution was selected as an optimization. Note that this appears statistically equivalent to having a significant estimate of 0 < ρ < 1, or both R F and R V > 0. To test performance, the power and type I error of the LR test were evaluated. The testing power is calculated as the probability of p-value < 0.05 for fitting Delaporte distribution to the real-world observations compared to the NB distribution. We generated pseudo-datasets with different sample sizes by random sampling with replacement, a method similar to nonparametric bootstrapping, from the datasets described in Section 2.2. The type I error rate was calculated as the probability of p-value < 0.05 for fitting Delaporte distribution to the NB distributed datasets against the NB distribution. We generated the NB-distributed datasets with Monte Carlo random sampling from NB distributions. Note that statistically, the pvalue < 0.05 from the LR test here was (roughly) equivalent to the AIC-based model selection with a cutoff of 2 units.
The parameter estimation of NB, and Delaporte distribution was obtained for each pseudoor NB-distributed dataset using the approach described in Section 2.3. We summarized the test statistic (π � ), power, and type I error rate based on the different sample sizes.

Extension of other types of real-world observations
Although helpful in estimating superspreading potentials, the number of offspring cases per index case in our dataset section was not always accurately reported [46]. In many situations, it is time or financially consuming for surveillance procedures to collect these datasets [67], and it is also difficult to maintain the consistency of reporting standards or secure sufficient samples [68]. Alternatively, the cluster size of next transmission generation, i.e., the one-generation cluster size, and the final outbreak size including a few seed cases are also commonly adopted to inform the characteristics of transmission. Thus, the theoretical frameworks in the following two sections were formulated to associate both types of real-world observations with the Delaporte distribution.
2.5.1 Next-generation cluster size. Cluster size data are frequently adopted to construct a statistical estimation [3,40,66]. Each one-generation cluster size observation is reported as the numbers of primary and secondary cases within a single transmission generation, which can also be simply translated into a number of primary cases and the cluster size of next-generation secondary cases. We discuss below the mathematical formulation of the distribution and likelihood function of a next-generation case cluster produced by a certain number of seed cases.
For a one-generation cluster of cases with size (i + j), that is, within a single transmission generation, where i (> 0) infectors generate j (� 0) infectees, we consider the summation of i independent and identically distributed (IID) random variables following the Delaporte distribution. Then, given the values of R F , R V , and k, the probability of observing an event in which i (� 0) infectors generate j (� 0) infectees can be formulated by employing the probability generating function (PGF) g D (�) in Eq (1). Thus, the PGF of the PMF of infectees number (j) generated by i infectors, h D (�), was as follows: By identifying the PGF G(�), we found that the distribution of the number infectees j generated by i infectors was also a Delaporte distribution, h D (j|i), with the parameters R F i, R V i, and ki, which was formulated as in Eq (6).
Alternatively, h D (�) in Eq (6) could also be transformed by replacing R F i with ρRi and R V with (1ρ)R, which was expressed as follows, It should be noted that for the new Delaporte distribution here, or in Eq (6), the fraction of fixed component (ρ) holds unchanged. As such, the likelihood function can be directly constructed by rearranging Eq (6) when one-generation cluster size observations were used to infer superspreading characteristics, that is, ρ and k. When ρ approaches 0, the Delaporte distribution reduces to the NB distribution [49], and thus the 'convolution' in the equation above vanished, i.e., a = j. Then, the distribution of the number of infectees j generated by i infectors was from the NB distribution (h NB ), which was also derived or adopted in previous studies [3,4,11,17,19,22,40,46,69]. Likewise, by using the branching process approach to characterize the size distribution introduced in [40,69,70], the formulation of Eq (6) can also be derived by obtaining the j-th derivative of g D (�) at 0 according to the property of PGF [71], which means the following relationship holds.
which can be shown algebraically or by mathematical induction (details omitted).

Final outbreak size with subcritical transmission.
Many outbreaks occur in the form of isolated cases, short chains of transmission, or small clusters [3,72], for example, diseases with weak human-to-human transmission [68] or vaccine-preventable infections in a vaccine-available setting [73]. Thus, offspring cases observations like those in our data section are limited and difficult to access because the transmission is unlikely to be sustained. These outbreaks are recognized as subcritical (or self-limited) outbreaks when the population reproduction number appears to be less than 1 [11,69], that is, R < 1, namely a weakly transmitting disease. Although the final outbreak size is frequently linked to subcritical transmission, the final outbreak size may also be observable for supercritical transmission (R > 1), which we will introduce below more rigorously. Each self-limited outbreak includes a group of cases connected by an unbroken series of transmission events (or chains), which was named the 'stuttering transmission chain' in [11].
Except for the first i seed (or imported) cases, each case in a self-limited outbreak must be produced by one of the total cases with size denoted by c. According to [11], each secondary case must be linked to one of the other cases. Thus, the probability of observing a stuttering chain (or self-limited outbreak) size c (� i) including i (> 0) cases is (i/c) and multiplies the probability of c primary cases causing (c-i) secondary cases in one generation, i.e., In other words, under the independent and identically distributed assumption of the branching process [71], the probability of having a stuttering chain of size c including i cases, denoted by ω D (c, i), is the (c The term i c is the normalization factor for the correction that i out of c cases are seed cases. This equation matches the relation derived in [40], which was also adopted in [57]. Rearranging the expression algebraically, we derive the exact formula of ω D (c, i) in Eq (7).
By replacing R F i with ρRi and R V i with (1ρ)Ri, an alternative version of ω D (c, i) was expressed as follows, Therefore, the likelihood function can be constructed based on Eq (7) when stuttering chain size observations are available. When ρ approaches 0, the Delaporte distribution reduces to the NB distribution [49], and thus a = c − i. Thus, the probability of observing the final outbreak size c including i cases based on the NB distribution (ω NB ), Alternatively, the form below of ω NB (c, i) was previously adopted, which was mathematically equivalent. . This formula was also adopted previously in [57]. As reported in [11,69], with adjustment, the formula in Eq (7) is also applicable for supercritical transmission. When R > 1, there is a chance of 1 À P 1 c¼i o D ðc; iÞ � � that the outbreak will never be extinct, which means the final outbreak size c becomes a defective random variable. Based on the property of the branching process, we may calculate the probability of outbreak extinction ε by solving ε = [g D (ε) i ] [69]. Thus, the likelihood function can also be constructed by adjusting ε as the denominator for supercritical transmission.
Of particular interest is the final size of the outbreak generated by single seed case, i.e., i = 1, which is the probability of c (� 1) primary cases causing (c − 1) secondary cases, i.e., h D (j = c − 1|i = c) = h D (c − 1|c), as in Eq (8).

Theoretical framework of different control schemes
We formulated the following two control schemes (I) and (II) with same reduction amount in reproduction number and compared their respective control efficacies in reducing the risks of superspreading [outcome (I)] or outbreak [outcome (II)]. For both schemes, we considered the control effect (ξ) in terms of the fractional reduction in the reproduction number (R), where ξ = 0 reflects no control and ξ = 1 reflects complete blockage of transmission.
Following [29], this control scheme (I) is expected to have the least efficacy in risk reduction and thus is treated as the baseline scenario.
In population-wide control measures, we consider that each individual reproduction number (λ) is reduced by a factor ξ (0 � ξ < 1) for fixed and variable components (λ F and λ V ), namely a relative reduction in the reproduction number. Then, on the population scale, the reproduction number (R) is also reduced by factor ξ, and thus the fixed and variable components become (1 − ξ)R F and (1 − ξ)R V , respectively. The controlled reproduction is (1 − ξ)R. Thus, the PMF of offspring cases (x) generated by one seed case is the following Delaporte distribution, f ð1Þ The superscript ' (1) ' is merely for labeling purposes rather than powering. For the final outbreak size (c � 1) generated by a single case under the control scheme (I), the PMF o ð1Þ D cjx ð Þ can be derived as follows,

Scheme (II): High-risk-specific control.
High-risk-specific control measures target individuals with higher risk of superspreading potentials, e.g., individuals who frequently travel and contact others, and staff members sharing common facilities in the workplace. Thus, interventive measures such as city lockdowns and travel bans [78,79], digital contact tracing at public places [80,81], and gathering restrictions may interfere with the potential risks of spreading the disease by targeting high-risk individuals.
High-risk-specific control measures prioritize the variable component of the individual reproduction number (λ V ). Despite λ F being unchanged, the value of λ V is reduced so that individuals with higher risks of superspreading are less likely to achieve their potential for spreading diseases. To guarantee comparability with the population-wide control scheme, we maintain that controlled reproduction is (1 − ξ) R, and thus the value of R V reduces ξR units. Then, on the population scale, the reproduction number (R) is reduced by factor ξ. In the scenario that ξR > R V , equivalently ξ > R V / R = 1ρ or ξ + ρ > 1, the reduction will lead to R V = 0, the remaining amount (ξR − R V ) for the reduction is then passed to the fixed component R F , and the Delaporte distribution reduces to the Poisson distribution with rate R F − (ξR − R V ) = (1 − ξ)R. Thus, the PMF of offspring cases (x) generated by one seed case is formulated as follows, f ð2Þ The superscript ' (2) ' is merely for labeling purposes instead of powering.
For the final outbreak size (c � 1) generated by a single case under the control scheme (II), the PMF o ð2Þ D cjx ð Þ can be derived as follows, In particular, when the Delaporte distribution is restricted to the NB distribution, the distributions f ð1Þ

Risk outcome (I): Superspreading event.
The superspreading event is defined as the situation where an index case produces more secondary cases than the superspreading threshold (y). Following [29], when given R, the superspreading threshold y is calculated as the 99th percentile of the Poisson distribution with rate R [17]. Mathematically, y satisfies Pr(X � y | X~Poisson(R)) = 0.99. For example, with the reproduction number in the range from 1.5 to 3 for COVID-19 [41,[82][83][84][85], the superspreading threshold (y) ranges from 5 to 8 secondary cases.
Because y can be determined for a given R, the risk of having a superspreading event is the probability that a seed case generates offspring cases equal to or greater than the superspreading threshold. When the control measures have no effect on reducing the reproduction number, i.e., ξ = 0, the risk of superspreading event r D is Under control schemes (I) and (II), the risks of a superspreading event are as follows.
respectively. Therefore, the control efficacies can be compared within or between control schemes given the same values of R or ξ.

Risk outcome (II): Large-scale outbreak.
A large-scale outbreak is defined as an outbreak with a final size (c) greater than 100, of which the threshold was adopted in [3,29,33]. Seeded by an index case, the final outbreak size c (� 1) is modelled in Eq (8) and is translated into h ð1Þ D cjx ð Þ and h ð2Þ D cjx ð Þ under control schemes (I) and (II), respectively. When ξ = 0, the risk of large-scale outbreak r D is Under control schemes (I) and (II), the risks of large-scale outbreak are respectively.

Control efficacy.
To compare different control strategies, the relative reduction in risk or relative efficacy approach was adopted [35]. For overdispersed transmission, most infected individuals do not contribute to the expansion of the epidemic, the final size of the outbreak could be drastically controlled by preventing relatively rare superspreading events [29]. Therefore, we measure the efficacy of control as the relative risk reduction (RRR) of having a superspreading event or leading to a large-scale outbreak in each seed case. As such, the following calculation applies to both risk outcomes (I) and (II).
Given R, the RRRs of control schemes (I) and (II) are respectively. As such, both RRR (1) (ξ) and RRR (2) (ξ) should be interpreted as the control efficacy when there is a reduction in R by factor ξ against that there is no change in R.

Results and discussion
By definition, the Delaporte distribution allows the decomposition of the individual reproduction number (λ) into two independent and additive components (i.e., λ F and λ V ). Although the offspring cases (X F ) generated from the λ F part are variable, the fixed component λ F = R F is constant. In contrast, the variable component λ V is a Gamma-distributed variable that accounts for the differences between individual cases and shares the same definition and interpretation as in the NB distribution [29,45]. As a generalization of the NB distribution, the Delaporte distribution appears different from the Poisson and NB distributions given the same mean R and dispersion k (see Fig 1), which is due to the effect of the additional parameter ρ. The term ρ quantifies the fraction of the mean reproducibility that is fixed (or the same) across different cases. The classic NB model restricted the fixed (baseline) fraction λ F to be 0, indicating that there must be a proportion of individuals with (almost) 0 transmissibility, which appears unrealistic. Conversely, the Delaporte distribution allowed λ F to be a non-negative value, which is more flexible for complex situations. Theoretically, a lower value of either ρ or k indicates a higher scale of variability in individual infectiousness [29], that is, variance in the distribution of offspring. With other parameters fixed, a smaller ρ leads to a larger (smaller) proportion of the most infectious primary cases (P) that produce the most (zero) secondary cases (Figs 2 and 3). The consistent negative relationship between ρ and superspreading potential was demonstrated, and this relationship appears stronger as k decreases. The most heterogeneous transmission occurs when both k and ρ are small, and the Delaporte distribution approaches the NB distribution. With the same R and k, the Lorenz curve of the Delaporte distribution falls between those of the Poisson and NB distributions (Fig 4), where the position of the Delaporte distribution depends on ρ.
Fitting to several datasets of offspring (or secondary) cases, our estimates of NB parameters were consistent with previous studies (Table 1). When the R F estimate was greater than 0 for the Delaporte distribution, the dispersion k estimate became greater than the k estimate of the NB distribution. We found that the Delaporte distribution led to an improved or equivalent fitting performance compared to the NB distribution in terms of AIC values. The improvement in fitting performance was also reflected by the estimates of R F , or equivalently ρ (not shown as the main result). When the sample size is large, for example, datasets #1-#3, the Delaporte distribution has a higher goodness-of-fit in terms of likelihood values. The Delaporte distribution more accurately captures the observed offspring data than the NB distribution (Fig 5). In datasets #1-#3, the high-density regions of posterior distributions of ρ were roughly skewed from 0.1 to 0.5. However, the improvement in explaining the real-world dataset becomes weak, or even not evident as sample size decreases, for example, datasets #4 and #5, where the NB distribution also yields satisfactory fitting performance. For datasets #2b and #2a, collected from 19 January to 19 April 2020 and from 20 April to 16 October 2020, respectively. It is worth noting that the estimated medians of ρ increased from 0.21 to 0.56, while k only had minor changes. With the same scales of k and R, the increase in ρ would lead to a decrease in the overdispersiveness of disease transmission, as well as a reduction in the risk of  superspreading. This finding was consistent with the conclusion in [33], which also discussed the impact of various local nonpharmaceutical interventions on the transmission characteristics of COVID-19 in South Korea.  The likelihood ratio (LR) test has been proposed for model selection between the NB and the Delaporte distributions [11,66], and yields satisfactory testing performance. We found an increasing statistical power of the LR test for identifying the improvement of Delaporte distribution as the sample size increased. The simulation results of the testing power show consistent trends as observed in datasets #1-#5 (Fig 6A). To secure a power larger than 0.80, surveillance may require a sample size above 400, see Fig 6B. Although the type I error rate appears slightly high around 0.03 when sample size ranges from 100 to 300 (Fig 6D-6E), while the type I error rate is generally conservative for a wide range of sample sizes from 30 to 3000 (Fig 6F). Similar non-monotone trends of the type I error rate have also been previously reported for other testing purposes [40]. The testing performance of increasing power and conservative type I error suggest that the LR test is informative in capture the true characteristics of over-dispersed offspring distribution with a low chance of false alarms.
In practical analysis, one may also be interested in obtaining estimators for R and k given the parameter estimates of the Delaporte distribution. Because the closest theoretical formula may be complex to derive, a convenient approximation using moments of the Delaporte distribution could be considered. To distinguish the dispersion parameters, we denote k NB and k D for the NB and Delaporte distributions, respectively. For a given Delaporte distribution, the first moment (i.e., mean) is R F + R V , and the second central moment (i.e., variance) is Thus, if let the NB distribution have the same value of mean and variance, for the approximated NB distribution, we have b Although this approximation can be directly calculated rapidly, by using the estimates of the example offspring datasets, we note that c k NB here appears slightly lower than the posterior estimates of k NB in Table 1.
The real-world datasets adopted in this study were offspring cases per seed case observations, but more generally, the Delaporte distribution can be extended to describing one-generation cluster or final outbreak size observations. For the one-generation cluster size j distribution, we derived that h D (j|i) also follows a Delaporte distribution with parameters not only determined on the original parameter set of f D (X) but also by the number of seed cases i. Specifically, f D (X) can be translated into h D (j|i) by multiplying parameters ρ and R by i, see Eq (6). A previous study determined that one-generation cluster size follows a NB distribution h NB (j|i) under the NB-distributed offspring assumption [40], which is similar to our extension of this finding to the situation of the Delaporte distribution. To assess the impact of ρ on disease outbreaks, the final outbreak size c distribution can be used to evaluate pandemic potentials seeded by i source (or imported) cases [2,38,73,86]. Thus, ω D (c, i)was derived in Eq (7), and appeared to be an extension of the NB version ω NB (c, 1) in [3,33], see the special case of Eq (8).
To illustrate the translation from the final outbreak size probability in Eq (7) to the likelihood-based estimation, we adopted the final outbreak size observations of the Middle East respiratory syndrome coronavirus (MERS-CoV infection in the Middle East region, which was reported in [87]. The dataset has a sample size of 55 outbreaks, including a total of 104 laboratory confirmed MERS cases, and all final outbreaks were seeded by single cases, as also summarized and studied in [3]. Hence, Eq (8) was used to construct the likelihood function for the Delaporte distribution. We estimated R F at 0.17 (95%CrI: 0.03, 0.45), R V at 0.32 (95% CrI: 0.01, 1.53), and k at 0.04 (95%CrI: 0.00, 0.19) with an AIC of 114.60. We also repeated the estimation using the NB distribution, which leads to R at 0.47 (95%CrI: 0.30, 0.78) and k at 0.27 (95%CrI: 0.10, 0.98) with an AIC of 115.68. For the previous estimates using NB in [3], it was estimated that R was 0.47 (95%CrI: 0.29, 0.80) and k was 0.26 (95%CrI: 0.09, 1.24), which was in line with our estimates. The k estimate appears lower in the Delaporte distribution, and the ρ estimate at 0.33 (95%CrI: 0.05, 0.98) was greater than 0, thus the fixed part of R was evident, which was also indicated by the difference in the AIC values.

PLOS COMPUTATIONAL BIOLOGY
Aside from the impact of k in determining the probability of risk outcome (I): in superspreading events, as described in [3,29], the parameter ρ also has an similar impact, and further influences the efficacy of different control strategies. With the same among (ξ) of reduction in R, the control efficacies (RRR) of both population-wide and high-risk-specific control schemes increased with ξ (Fig 7). To compare the two control schemes, we found that the control scheme (II) has a higher control efficacy than scheme (I) in terms of the RRR of superspreading event, i.e., RRR (2,1) (ξ). Effective control efforts may allow us to anticipate highly infectious source cases or the contexts in which a seed case may likely expose many susceptible individuals in advance. Then, the scale of the variable component of the reproduction number was reduced efficiently under the control scheme (II), such that a substantial proportion of superspreaders can be controlled. With ξ < 1 − ρ, the general (or linear) tendency of RRR (2,1) (ξ) increased rapidly as ξ or ρ increased (Fig 8). The largest value of RRR (2,1) (ξ) can be reached when ξ is close (but not necessarily approaching) to 1 − ρ. When ρ = 0, we illustrated that RRR (1) (ξ) = RRR (2) (ξ) (Fig 7A-7D), which indicated that RRR (2,1) (ξ) = 0. In other words, with the effects of ρ (> 0), the outperformance of high-risk-specific control scheme may become evident in terms of achieving RRR (2,1) > 0 for some values of ξ (Fig 8). In each panel, the dispersion parameter k is fixed at 0.2, the shading region indicates the situation that ξ � 1 − ρ, and the bold red segment highlights the range of ρ from 0.1 to 0.5, which characterizes the feature of COVID-19. In each panel label, 'R' is the reproduction number, and 'reduction in R' is the relative reduction in reproduction number (ξ). The 'NB' in the horizontal axis label stand = s for negative binomial (distribution). https://doi.org/10.1371/journal.pcbi.1010281.g008 For effective control strategies aiming to reduce the risk of outcome (II): large-scale outbreak, the RRR was determined by ξ, ρ, and R. Consistent with the trends of risk outcome (I) in Fig 7, a large-scale outbreak was less likely to occur as ξ increased despite control schemes (Fig 9). When ρ = 0, we illustrated that RRR (1) (ξ) = RRR (2) (ξ), see Fig 9A-9E, which indicated RRR (2,1) (ξ) = 0. Unlike SSE, the population-wide control scheme outperformed the high-riskspecific control scheme with RRR (2,1) < 0 when R was large and ξ was small, but the direction (or sign) may change to RRR (2,1) > 0 for small R or large ξ (Fig 10). On one hand, the highrisk-specific control scheme was more effective in reducing the outbreak risks under subcritical transmission. In self-limited (or stuttering) outbreak, although SSEs rarely occur, they have a significant contribution to the expansion of transmission [57]; thus, the risk of outbreak can be drastically reduced by targeting high-risk individuals [36]. On the other hand, this implied that when the epidemic curve is growing in terms of reproduction numbers larger than 1, a substantial proportion of transmission is due to the fixed part (λ F = R F ) of individual infectiousness, that is, subspreading events [88]. Despite the variable part R V , a large R F results in stable reproducibility of infections, and RRR (2,1) < 0 with a moderate scale of ρ (from 0.1 to 0.5 for COVID-19) (Fig 10T). Therefore, population-wide interventions may successfully control disease transmission on a general scale before the implementation of high-risk-specific control strategies subsequently.
Conversely, under extremely intensive control measures in terms of ξ ! 1, the chance of large-scale outbreak diminishes despite different control schemes. For example, mainland China has achieved satisfactory COVID-19 control outcomes [89]. Although Chinese authorities relaxed population-wide policies in recent months, high-risk-specific control measures secured intensive and compulsory digital contact tracing efforts to monitor the risks of infection at the level of an individual's daily routine [90,91]. In our theoretical framework, this indicates a high value of ξ for control scheme (II), which leads to a remarkably low risk of outbreaks (Fig 9).
This study has limitations. First, although the Delaporte distribution is a theoretical generalization of the NB distribution, our data analysis focused on determining whether there is statistical evidence supporting the improvement in fitting performance without investigating the mechanistic side of the decomposition of the reproduction number. For example, populationlevel factors such as contact size and frequency (e.g., household size) [25], and heterogeneity of population density, or individual-level factors such as biological determinants (e.g., evolutionary adaptation and in-host viral kinetics) [92,93], behavioral or social factors [32], and lifestyle habits might contribute to establishing superspreading potentials [29,40]. Second, with regard to the parameter estimation part, we assumed that all offspring observations were accurately reported without selection bias, which might not always be acceptable [85,[94][95][96][97]. In cases of considerable reporting or selection bias, adjustments on statistical inference can resolve such issues to some extent by modifying the likelihood framework, for example, by truncation and compounding [11,46,57]. Lastly, for the evaluation of control effects, although the final outbreak size (c) distribution was formulated under two schemes, we failed to find an analytical form for the condition with respect to R and ξ, such that RRR (2,1) > 0 or otherwise. Instead, we performed numerical simulations to check the sign of RRR (2,1) (shown visually in Fig 10), regarding the most feasible parameter ranges of COVID-19. Hence, the Delaporte distribution needs to be considered as a tool to monitor the three parameters to understand the transmission characteristics of infectious diseases and to provide information for strategic decisionmaking processes involving control measures.
In summary, as a generalization of the classic NB distribution, the Delaporte distribution can be adopted to decompose the reproduction number from the individual level to the population level and to characterize the transmission of infectious disease. The Delaporte distribution demonstrates statistical improvement in fitting the distributions of the real-world offspring cases' distributions against the NB distribution, and it presents increasing power and conservative type I error rates in detecting such an improvement in the goodness-of-fit with the LR test. Numerical simulation illustrated that the three parameters of the Delaporte distribution are important in understanding disease transmission characteristics and for advising of appropriate control strategies and providing new insights distinct from the NB model.

Declarations
Ethics approval and consent to participate. The COVID-19 contact tracing data were obtained from literature, which were originally collected via the public domains, and thus neither ethical approval nor individual consent was applicable.